A new approach to describe the taxonomic structure of microbiome and its application to assess the relationship between microbial niches

Background Data from microbiomes from multiple niches is often collected, but methods to analyse these often ignore associations between niches. One interesting case is that of the oral microbiome. Its composition is receiving increasing attention due to reports on its associations with general health. While the oral cavity includes different niches, multi-niche microbiome data analysis is conducted using a single niche at a time and, therefore, ignores other niches that could act as confounding variables. Understanding the interaction between niches would assist interpretation of the results, and help improve our understanding of multi-niche microbiomes. Methods In this study, we used a machine learning technique called latent Dirichlet allocation (LDA) on two microbiome datasets consisting of several niches. LDA was used on both individual niches and all niches simultaneously. On individual niches, LDA was used to decompose each niche into bacterial sub-communities unveiling their taxonomic structure. These sub-communities were then used to assess the relationship between microbial niches using the global test. On all niches simultaneously, LDA allowed us to extract meaningful microbial patterns. Sets of co-occurring operational taxonomic units (OTUs) comprising those patterns were then used to predict the original location of each sample. Results Our approach showed that the per-niche sub-communities displayed a strong association between supragingival plaque and saliva, as well as between the anterior and posterior tongue. In addition, the LDA-derived microbial signatures were able to predict the original sample niche illustrating the meaningfulness of our sub-communities. For the multi-niche oral microbiome dataset we had an overall accuracy of 76%, and per-niche sensitivity of up to 83%. Finally, for a second multi-niche microbiome dataset from the entire body, microbial niches from the oral cavity displayed stronger associations to each other than with those from other parts of the body, such as niches within the vagina and the skin. Conclusion Our LDA-based approach produces sets of co-occurring taxa that can describe niche composition. LDA-derived microbial signatures can also be instrumental in summarizing microbiome data, for both descriptions as well as prediction. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05575-8.

This niche is mainly composed of sub-community 4. Sub-community 3 is also present in a smaller proportion.This niche is mainly composed of sub-community 1, 2 and 5.

Excluded sample ID HMP dataset
This section present the list of the excluded ID in the Human Microbiome Project dataset.Those samples present a dubious oral microbiome composition since they had a big proportion of OTU001 Propionibacterium, a bacteria very common in the skin microbiome but usually very rare in the oral microbiome.All the samples coming from the following IDs were therefore excluded from the enire analysis.

Evaluation of the predictive model
This section gives more information on the predictive model we run (section 3.1).We aimed to predict the original niche (from the Dutch Oral Dataset) of the samples based on their sub-communities composition.This helps us to assess the meaningfulness of our sub-communities.

Insights about salivary sub-communities
In this section we propose to link our salivary sub-communities with a set of covariates.We performed spearman correlation test (for numerical) wilcoxon-test (for categorical) between the sub-communities proportion between sample and a set of covariates.This allow us to have a more detailed pictures than using raw OTUs since in our case a same OTU can have different biological interpretation du to the other species in the sub-community.
For example, we can see in Figure 13 even though the sub-communities 3 and 4 have similar taxonomic composition, we observe that sub-community 4 is significantly correlated with many covariates while subcommunity 3 is not.
Those two sub-community are indeed composed of a Prevotella-Streptococcus-Veillonella combination but the prescence of other OTUs in the sub-community provided different results.
Finally, we can point out that the second sub-community is highly correlated with Lysozome and with a lower salivary pH.7 Simulation study and fixing the number of sub-communities

Simulation study
In this section, we present the output of our simulation study.We simulated a total of 20 different datasets using LDA generative process.The latent matrices we used as input were the ones obtained by applying the LDA function to our real dataset.Then, we applied the dmn function to check if the number of sub-communities proposed by dmn checked the true number of sub-communities used to generate our simulated datasets (K = 5).It happened that the number of estimated sub-communities (K est ) clearly depends of the number of included samples.The dmn function indeed overestimates the true number of sub-communities if the number of samples if too high.Therefore, we are confident that our choice to use a lower number of sub-communities than the one proposed by dmn when applying LDA to all niches simultaneously was relevant.

Usual metrics to fix the number of communities
We plot in this section the perplexity and the coherence-score based on the number of sub-communities for the complete DODA.
It appeared that those metrics did not help us to fix our number of sub-communities.On the one hand, The coherence score only varied marginally between K = 2 and K = 12.On the other hand, the perplexity plot did not present outstanding elbows and it remained difficult to assess the best K based on that plot.

LDAvis example output
Here we plot some output forn the LDAvis package which help us to choose the number of communities.We can see that K = 5 is the highest number of sub-communities such we dont observe overlapping in 2 dimensions visualization.

Figure S 2 :
Figure S 2: Details of the distribution of the sub-communities within the anterior tongue.This niche is mainly composed of sub-community 3. Sub-community 4 is also present in a smaller proportion.

Figure S 3 :
Figure S 3: Details of the distribution of the sub-communities within the posterior tongue.This niche is mainly composed of sub-community 4. Sub-community 3 is also present in a smaller proportion.

Figure S 4 :
Figure S 4: Details of the distribution of the sub-communities within the interproximal plaque.This niche is mainly composed of sub-community 2. Sub-communities 1 and 5 are also present in a smaller proportion.

Figure S 5 :Figure S 6 :
Figure S 5: Details of the distribution of the sub-communities within the supragingival plaque.This niche is mainly composed of sub-community 5. Sub-community 1 is also present in a smaller proportion.

Figure S 7 :Figure S 8 :
Figure S 7: PCA plot of the doda microbiome.The plaque and mucosal samples are split in the same way that Figure 2 of the main paper.

Figure S 13 :
Figure S 13: Heatmap of the -log10 of the p.values correlation between sub-communities proportion and other covariates.Some sub-communities are highly correlated with clinical or chemical variables.

Figure S 14 :
Figure S 14: Number of sub-communities proposed by DMM on the LDA simulated dataset.The true number of sub-communities corresponds to the dotted line.

Figure S 15 :
Figure S 15: Index used by DMM to estimate the best number of sub-communities for n = 1474.

Figure S 16 :
Figure S 16: Computation time needed to fit the DMM model for n varying from 100 to 1474.

Figure S 17 :
Figure S 17: Mean of the cross-validated perplexity with a 4-fold cross validation depending on the number of sub-communities.

Figure S 18 :
Figure S 18: Coherent score based on the number of sub-communities.We can see it remains stable regardless of K.

Figure S 20 :
Figure S 20: Output of the LDAvis package for K = 5.We can see that there is no overlapping between our sub-communities.

Figure S 21 :
Figure S 21: Output of the LDAvis package for K = 6.Here, sub-communities 2 and 6 present similar composition.

Figure S 22 :
Figure S 22: Output of the LDAvis package for K = 10.While the number of sub-communities increases, we observe more overlapping and some sub-communities present very few reads (number 10).

Figure S 23 :
Figure S 23: Output of the LDAvis package for K = 15.As for K = 10, more and more sub-communities have very few reads or overlapped with each others.

Class Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
SutterellaFigure S 11: Composition of LDA-derived sub-communities using K = 3 for each skin and stool niche from HMP dataset.While the sub-communities have been built on OTU level, we plot here the lowest taxonomic level available for each OTUs.Only the taxa contributing for at least 5 % of a sub-communities are plotted here.StreptococcusFigure S 12: Composition of LDA-derived sub-communities using K = 3 for each vaginal niche from HMP dataset.While the sub-communities have been built on OTU level, we plot here the lowest taxonomic level available for each OTUs.Only the taxa contributing for at least 5 % of a sub-communities are plotted here.